feat: Dubai villa lead scraper + Apify bridge + Sheets sync#1
feat: Dubai villa lead scraper + Apify bridge + Sheets sync#1vishnu-madhavan-git wants to merge 1 commit intomainfrom
Conversation
- agents/dubai_villa_scraper.py: stdlib-only scraper for PropertyFinder + Bayut - Extracts UAE phone numbers, owner names, areas, prices - Deduplicates by phone number - Outputs JSON to data/state/villa_leads.json - agents/apify_dubai_scraper.py: Apify actor bridge (faster path) - Uses redoubtable_bubble/dubai-real-estate-scraper actor - Handles anti-bot automatically - Requires APIFY_TOKEN in .env - core/leads-bridge.js: syncs villa_leads.json to Google Sheets CRM - Deduplicates synced leads - Uses existing SheetsService pattern Use case: IXR interior design client acquisition from Dubai villa owners
📝 WalkthroughWalkthroughA Dubai villa lead scraping system is added with multiple collection options: PropertyFinder/Bayut (no API key required) and Apify-based scraping. Includes deduplication logic, JSON persistence at Changes
Sequence DiagramssequenceDiagram
participant DLS as Dubai Villa Scraper
participant PF as PropertyFinder/Bayut
participant Store as JSON Store
participant Dedup as Deduplicator
DLS->>PF: Fetch listings (area, pagination)
PF-->>DLS: HTML response
DLS->>DLS: Parse & extract leads<br/>(phone, price, URL, area)
DLS->>Store: Load existing leads
Store-->>DLS: villa_leads.json
DLS->>Dedup: Check for duplicates<br/>(by phone number)
Dedup-->>DLS: New leads, dup count
DLS->>Store: Append & save updated list
Store-->>DLS: Persisted
DLS->>DLS: Output summary JSON<br/>(new, total, duplicates)
sequenceDiagram
participant Bridge as Leads Bridge
participant VL as villa_leads.json
participant SL as synced_leads.json
participant Sheets as Google Sheets
Bridge->>VL: Read all villa leads
VL-->>Bridge: Lead array
Bridge->>SL: Load synced phones
SL-->>Bridge: Phone list
Bridge->>Bridge: Filter new leads<br/>(exclude synced)
Bridge->>Sheets: Initialize API
loop For each new lead
Bridge->>Sheets: Sync lead<br/>(name, area, price, URL)
Sheets-->>Bridge: Success/Failure
end
Bridge->>SL: Append synced phones
SL-->>Bridge: Updated
Bridge->>Bridge: Return sync summary
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| } | ||
|
|
||
| async function syncLeadsToSheets() { | ||
| const sheetsService = new (require("./sheets"))(); |
There was a problem hiding this comment.
🔴 new (require('./sheets'))() fails because sheets.js exports a singleton instance, not a class
core/sheets.js:74 exports module.exports = new SheetsService() — a singleton instance, not the SheetsService class. In core/leads-bridge.js:30, new (require('./sheets'))() attempts to call new on that instance, which will throw TypeError: (intermediate value) is not a constructor at runtime. The function syncLeadsToSheets() can never succeed.
| const sheetsService = new (require("./sheets"))(); | |
| const sheetsService = require("./sheets"); |
Was this helpful? React with 👍 or 👎 to provide feedback.
| // Mark all as synced | ||
| const newSynced = [...loadSynced(), ...unsynced]; | ||
| saveSynced(newSynced); |
There was a problem hiding this comment.
🔴 Failed-to-sync leads are incorrectly marked as synced
At core/leads-bridge.js:58, all unsynced leads are unconditionally added to the synced list, even if sheetsService.addLead() threw an error for some of them (caught at line 52). This means leads that failed to sync to Google Sheets will never be retried on subsequent runs, since they'll already be in synced_leads.json. The code should only mark successfully synced leads.
Prompt for agents
In core/leads-bridge.js, lines 43-59: Track which leads were successfully synced in an array (e.g., `successfullySynced`), push each lead to it inside the try block after the synced++ increment (around line 51), and then on line 58, use that array instead of `unsynced` when building `newSynced`. This ensures only leads that were actually synced to Google Sheets get marked in synced_leads.json.
Was this helpful? React with 👍 or 👎 to provide feedback.
| def fetch_results(token: str, run_id: str) -> list[dict]: | ||
| """Fetch results from completed actor run.""" | ||
| result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) | ||
| items = result.get("data", {}).get("items", []) | ||
| log(f"Fetched {len(items)} items from Apify") | ||
| return items |
There was a problem hiding this comment.
🔴 Apify dataset items API returns a JSON array, not a {data: {items: []}} object
The Apify API endpoint /v2/actor-runs/{runId}/dataset/items returns a raw JSON array of items, not an object with a data.items wrapper. At agents/apify_dubai_scraper.py:109-110, apify_request parses the response with json.loads() which yields a Python list. Then result.get("data", {}).get("items", []) will raise AttributeError: 'list' object has no attribute 'get' because lists don't have .get(). This means fetch_results() will always crash after a successful actor run.
| def fetch_results(token: str, run_id: str) -> list[dict]: | |
| """Fetch results from completed actor run.""" | |
| result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) | |
| items = result.get("data", {}).get("items", []) | |
| log(f"Fetched {len(items)} items from Apify") | |
| return items | |
| def fetch_results(token: str, run_id: str) -> list[dict]: | |
| """Fetch results from completed actor run.""" | |
| result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) | |
| if isinstance(result, list): | |
| items = result | |
| else: | |
| items = result.get("data", {}).get("items", []) | |
| log(f"Fetched {len(items)} items from Apify") | |
| return items |
Was this helpful? React with 👍 or 👎 to provide feedback.
| encoded_area = urllib.parse.quote(area) if area else "dubai" | ||
| url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1" |
There was a problem hiding this comment.
🔴 Bayut URL construction double-encodes area: urllib.parse.quote() then .replace(' ', '-') is a no-op
At agents/dubai_villa_scraper.py:219-220, when area is provided, encoded_area = urllib.parse.quote(area) converts spaces to %20 (e.g., "Palm Jumeirah" → "Palm%20Jumeirah"). The subsequent .lower().replace(' ', '-') on line 220 is then a no-op since there are no literal spaces left, producing a URL like .../villa/palm%20jumeirah/ instead of the expected .../villa/palm-jumeirah/. This will likely return a 404 or wrong results from Bayut.
| encoded_area = urllib.parse.quote(area) if area else "dubai" | |
| url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1" | |
| area_slug = area.lower().replace(" ", "-") if area else "dubai" | |
| url = f"https://www.bayut.com/for-rent/villa/{area_slug}/?owner_only=1" |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Pull request overview
Adds an automated lead-generation pipeline for Dubai villa listings, including two Python-based scrapers (stdlib + Apify) and a Node bridge that syncs collected leads into the existing Google Sheets CRM integration.
Changes:
- Add
agents/dubai_villa_scraper.pyto scrape PropertyFinder/Bayut and persist leads todata/state/villa_leads.json. - Add
agents/apify_dubai_scraper.pyto pull owner-contact leads via an Apify actor and persist them in the same state file. - Add
core/leads-bridge.jsplus README updates to sync stored leads into Google Sheets.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| core/leads-bridge.js | New bridge that loads leads from data/state and pushes unsynced ones to Google Sheets. |
| agents/requirements.txt | Updates comments describing scraper dependency expectations. |
| agents/dubai_villa_scraper.py | New stdlib-only scraper that collects and deduplicates leads into state JSON. |
| agents/apify_dubai_scraper.py | New Apify API bridge that runs an actor, normalizes results, and writes leads to state JSON. |
| README.md | Documents how to run both scrapers and how to sync leads to Sheets. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| async function syncLeadsToSheets() { | ||
| const sheetsService = new (require("./sheets"))(); |
There was a problem hiding this comment.
core/sheets.js exports an already-instantiated SheetsService (module.exports = new SheetsService()), but this code treats it like a constructor (new (require("./sheets"))()). That will throw at runtime. Use the exported instance directly and call init()/addLead() on it.
| const sheetsService = new (require("./sheets"))(); | |
| const sheetsService = sheets; |
| // Mark all as synced | ||
| const newSynced = [...loadSynced(), ...unsynced]; | ||
| saveSynced(newSynced); |
There was a problem hiding this comment.
This marks all unsynced leads as synced even if some addLead calls fail (errors are caught but the lead is still appended to synced_leads.json). This will prevent retries and can permanently drop leads from being synced. Track only successfully synced leads (or update the synced file incrementally on each success) before writing synced_leads.json.
| encoded_area = urllib.parse.quote(area) if area else "dubai" | ||
| url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1" |
There was a problem hiding this comment.
Bayut URL construction mixes quote() with a slug-style path. For areas like "Palm Jumeirah", urllib.parse.quote(area) yields Palm%20Jumeirah, so the URL becomes /villa/palm%20jumeirah/… rather than the expected hyphenated slug. Build the slug first (lowercase + replace spaces with '-') and then quote (or avoid quoting if the slug is URL-safe) to ensure the listings page resolves correctly.
| encoded_area = urllib.parse.quote(area) if area else "dubai" | |
| url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1" | |
| if area: | |
| # Build a Bayut-style slug: lowercase and hyphen-separated, then URL-encode if needed. | |
| area_slug = area.strip().lower().replace(" ", "-") | |
| encoded_area = urllib.parse.quote(area_slug, safe="-") | |
| else: | |
| encoded_area = "dubai" | |
| url = f"https://www.bayut.com/for-rent/villa/{encoded_area}/?owner_only=1" |
| const sheets = require("./sheets"); | ||
|
|
There was a problem hiding this comment.
const sheets = require("./sheets"); is unused (and syncLeadsToSheets requires the module again later). After fixing the SheetsService instantiation, keep a single import and use it to avoid duplication and dead code.
| area_slug = area.lower().replace(" ", "-") if area else "" | ||
| base_url = "https://www.propertyfinder.ae/en/search?c=2&t=1&fu=1&rp=y" | ||
| if area_slug: |
There was a problem hiding this comment.
area_slug is computed but never used. Remove it or use it consistently when building the query/location parameter to keep the scraping logic clear.
| area_slug = area.lower().replace(" ", "-") if area else "" | |
| base_url = "https://www.propertyfinder.ae/en/search?c=2&t=1&fu=1&rp=y" | |
| if area_slug: | |
| base_url = "https://www.propertyfinder.ae/en/search?c=2&t=1&fu=1&rp=y" | |
| if area: |
| # Simpler extraction - look for villa data in JSON-LD or meta tags | ||
| # PropertyFinder embeds listing data as JSON | ||
| json_matches = re.findall(r'window\.__INITIAL_STATE__\s*=\s*({.*?});\s*</script>', html, re.DOTALL) | ||
| if not json_matches: | ||
| json_matches = re.findall(r'"properties"\s*:\s*(\[.*?\])', html, re.DOTALL) | ||
|
|
There was a problem hiding this comment.
json_matches is computed but never used. If the intent is to parse window.__INITIAL_STATE__ for structured listing data (more robust than regex), consider implementing that; otherwise remove this block to reduce dead code.
| # Simpler extraction - look for villa data in JSON-LD or meta tags | |
| # PropertyFinder embeds listing data as JSON | |
| json_matches = re.findall(r'window\.__INITIAL_STATE__\s*=\s*({.*?});\s*</script>', html, re.DOTALL) | |
| if not json_matches: | |
| json_matches = re.findall(r'"properties"\s*:\s*(\[.*?\])', html, re.DOTALL) |
| def fetch_url(url: str, headers: dict = None) -> str | None: | ||
| """Fetch a URL with retry logic.""" | ||
| default_headers = { | ||
| "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36", | ||
| "Accept": "text/html,application/xhtml+xml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", |
There was a problem hiding this comment.
fetch_url uses the PEP 604 union type (str | None), which requires Python 3.10+. The repo README currently targets “system python” without a minimum version, and other agents avoid this syntax. Consider switching to Optional[str] (and similar) or explicitly documenting/enforcing Python >= 3.10 for agents.
| for p in phones: | ||
| clean = re.sub(r'[\s\-]', '', p) | ||
| if clean not in normalized: | ||
| normalized.append(clean) | ||
| return normalized |
There was a problem hiding this comment.
deduplicate() keys off lead["phone"], but normalization here only strips spaces/dashes. The same UAE number can appear as 00971..., +971..., or local 05... and won’t dedupe correctly. Normalize to a single canonical format (e.g. E.164 +971...) before adding to normalized.
| def run_actor(token: str, area: str, max_items: int, property_type: str = "villa") -> str | None: | ||
| """Start the Apify actor run and return run ID.""" |
There was a problem hiding this comment.
This file also uses PEP 604 union types in return annotations (e.g. str | None, dict | None), which require Python 3.10+. If agents are intended to run on an unspecified “system python”, consider using Optional[...]/Union[...] or documenting/enforcing Python >= 3.10.
| def fetch_results(token: str, run_id: str) -> list[dict]: | ||
| """Fetch results from completed actor run.""" | ||
| result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) | ||
| items = result.get("data", {}).get("items", []) |
There was a problem hiding this comment.
fetch_results assumes apify_request returns an object with data.items, but the dataset items endpoint commonly returns a raw JSON array. If a list is returned, result.get(...) will throw and the scraper will crash. Adjust apify_request/fetch_results to handle a list response (or request a response shape that’s always an object).
| items = result.get("data", {}).get("items", []) | |
| # The Apify dataset items endpoint may return either: | |
| # - a raw JSON array of items, or | |
| # - an object that wraps items under data.items. | |
| if isinstance(result, list): | |
| items = result | |
| elif isinstance(result, dict): | |
| data = result.get("data", result) | |
| if isinstance(data, dict): | |
| items = data.get("items", []) | |
| elif isinstance(data, list): | |
| items = data | |
| else: | |
| items = [] | |
| else: | |
| items = [] |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@agents/apify_dubai_scraper.py`:
- Around line 7-12: The script documents that APIFY_TOKEN can live in a .env but
only reads os.environ; fix by loading .env at module start (before any env
access) using python-dotenv: add "from dotenv import load_dotenv" and call
"load_dotenv()" near the top of the file before retrieving APIFY_TOKEN (and
similarly before the env checks around the block referenced at lines ~175-180).
Alternatively, update the CLI docs to remove the .env claim — but the preferred
fix is to call load_dotenv() before accessing APIFY_TOKEN so os.environ sees
variables from .env.
- Around line 148-158: Concurrent runs overwrite the shared JSON leads store
because load_existing_leads and save_leads read the whole file, mutate
in-memory, and rewrite it; change these functions to perform concurrency-safe
updates (e.g., acquire a file lock around read-modify-write or switch to an
append-only/JSONL writer) so overlapping scrapers don't lose data. Specifically,
update load_existing_leads, save_leads and any callers that append to LEADS_FILE
so they obtain an exclusive lock on LEADS_FILE (or open it in append mode for
JSONL) before reading/writing, merge new leads into the existing set safely, and
use atomic replace/rename when writing to avoid partial writes; ensure the same
locking/format is used by agents/dubai_villa_scraper.py to keep behavior
consistent.
- Around line 107-111: The fetch_results function assumes apify_request returned
a dict and calls .get on it, but the /actor-runs/{run_id}/dataset/items endpoint
returns a raw list; update fetch_results to handle both shapes by checking the
type of the response from apify_request (called in fetch_results) and set items
= result if it's a list, otherwise fall back to result.get("data",
{}).get("items", []); keep the log(f"Fetched {len(items)} items from Apify") and
ensure the function returns the items list.
In `@agents/dubai_villa_scraper.py`:
- Around line 49-59: The read-modify-write on LEADS_FILE in
load_existing_leads/save_leads causes lost-update races; wrap the entire
sequence that reads, merges new leads, and writes back with a file-level lock
(e.g., using a FileLock/portalocker) so concurrent agents serialize access to
LEADS_FILE, perform a read->merge-by-unique-id->write under the same lock, and
write atomically (write to temp + os.replace) to avoid corruption; apply the
same locking/merge pattern to the other lead-write logic referenced around the
second block (lines 302-319) and reference the functions load_existing_leads and
save_leads when locating the code to change.
- Around line 219-237: The scraper only fetches the first Bayut search page so
it under-collects when max_results is larger; modify the listing collection
around encoded_area/url/fetch_url/listing_urls to paginate (add a page
parameter, e.g. append "&page={page}" to the existing query), loop fetching
pages starting at page=1, extract and accumulate unique listing URLs into
listing_urls until you have >= max_results or no new URLs are returned, and then
slice to max_results before iterating to create leads; ensure you break on empty
pages and update logs to show pages fetched.
In `@core/leads-bridge.js`:
- Around line 43-59: The code currently appends all unsynced leads to persistent
storage regardless of whether sheetsService.addLead succeeded; change the logic
to only persist leads that were actually synced by collecting successful leads
(e.g., push to a local array inside the try block or incrementally build
syncedLeads) and then call saveSynced([...loadSynced(), ...syncedLeads]) instead
of saveSynced([...loadSynced(), ...unsynced]); keep the existing try/catch
around sheetsService.addLead and error logging for failed attempts so transient
failures remain retryable.
- Around line 29-31: The syncLeadsToSheets function instantiates the Sheets
module with new (require("./sheets"))() which is incorrect because
core/sheets.js exports a singleton; change the code to require/import the
exported sheets instance (the exported symbol from "./sheets") and use that
instance (e.g., sheets.init()) instead of creating a new object; if you actually
need the class, update core/sheets.js to also export the class constructor and
require that specific export, but by default reuse the exported sheets singleton
rather than calling new on require("./sheets").
In `@README.md`:
- Around line 57-70: Update the README to explicitly require Python 3.10+ for
running the scraper commands: add a short note above the examples stating
"Requires Python 3.10+" (or "Python 3.10 or later") because the agents
(agents/dubai_villa_scraper.py and agents/apify_dubai_scraper.py) use PEP 604
union syntax (e.g., str | None) and newer parameterized built-ins; also keep the
APIFY_TOKEN note for the Apify scraper so users know environment setup
requirements.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aee2bf30-f927-4cf6-aaf6-4e1e0c9efd24
📒 Files selected for processing (5)
README.mdagents/apify_dubai_scraper.pyagents/dubai_villa_scraper.pyagents/requirements.txtcore/leads-bridge.js
| This is the FAST path - uses Apify's actor which handles anti-bot measures. | ||
| Requires APIFY_TOKEN in .env | ||
|
|
||
| Usage: | ||
| python agents/apify_dubai_scraper.py | ||
| python agents/apify_dubai_scraper.py --area "Palm Jumeirah" --max 100 |
There was a problem hiding this comment.
Load .env here or change the documented CLI contract.
The docstring says APIFY_TOKEN can live in .env, but this entry point only checks os.environ. Running python agents/apify_dubai_scraper.py as documented will still hit APIFY_TOKEN not set unless the caller exported the variable first.
Also applies to: 175-180
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@agents/apify_dubai_scraper.py` around lines 7 - 12, The script documents that
APIFY_TOKEN can live in a .env but only reads os.environ; fix by loading .env at
module start (before any env access) using python-dotenv: add "from dotenv
import load_dotenv" and call "load_dotenv()" near the top of the file before
retrieving APIFY_TOKEN (and similarly before the env checks around the block
referenced at lines ~175-180). Alternatively, update the CLI docs to remove the
.env claim — but the preferred fix is to call load_dotenv() before accessing
APIFY_TOKEN so os.environ sees variables from .env.
| def fetch_results(token: str, run_id: str) -> list[dict]: | ||
| """Fetch results from completed actor run.""" | ||
| result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) | ||
| items = result.get("data", {}).get("items", []) | ||
| log(f"Fetched {len(items)} items from Apify") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and inspect the apify_dubai_scraper.py file
find . -name "apify_dubai_scraper.py" -type fRepository: vishnu-madhavan-git/automation
Length of output: 105
🌐 Web query:
Apify API dataset-items endpoint response format /actor-runs/{runId}/dataset/items
💡 Result:
GET /v2/actor-runs/{runId}/dataset/items is the Actor-run “default dataset” items endpoint and it behaves the same as the regular Get dataset items endpoint (same query params + same output formats). (docs.apify.com)
Default response (no format param)
format=json(default): returns a JSON array of item objects, e.g.[{...},{...}]. (docs.apify.com)
Other response formats (format=...)
format=jsonl: newline-delimited JSON (one JSON object per line, ending with\n). (docs.apify.com)format=csv: CSV text (header\nrow\n...), with optional UTF‑8 BOM behavior controlled bybom=0|1. (docs.apify.com)- Also supported:
html,xlsx,xml,rss(returned as the corresponding textual/binary representation). (docs.apify.com)
Pagination metadata (via response headers)
The items response includes these headers:
X-Apify-Pagination-OffsetX-Apify-Pagination-LimitX-Apify-Pagination-CountX-Apify-Pagination-Total(docs.apify.com)
Common query params that affect the output shape
offset,limitfields(pick only listed fields),omitunwind,flattenclean(shortcut forskipHidden=true+skipEmpty=true) (docs.apify.com)
Citations:
- 1: https://docs.apify.com/api/v2/actor-run-get
- 2: https://docs.apify.com/api/v2/dataset-items-get
- 3: https://docs.apify.com/api/v2/dataset-items-get
- 4: https://docs.apify.com/api/v2/dataset-items-get
- 5: https://docs.apify.com/api/v2/dataset-items-get
- 6: https://docs.apify.com/api/v2/dataset-items-get
- 7: https://docs.apify.com/api/v2/dataset-items-get
🏁 Script executed:
# Search for the apify_request function to understand response handling
rg "def apify_request" -A 10 -t pyRepository: vishnu-madhavan-git/automation
Length of output: 978
🏁 Script executed:
# Inspect the actual fetch_results function in the file to confirm code and line numbers
cat -n agents/apify_dubai_scraper.py | sed -n '100,120p'Repository: vishnu-madhavan-git/automation
Length of output: 936
🏁 Script executed:
# Also check the full context around the function to understand usage
cat -n agents/apify_dubai_scraper.py | sed -n '1,30p'Repository: vishnu-madhavan-git/automation
Length of output: 1126
🏁 Script executed:
# Check the complete apify_request function to see full implementation
cat -n agents/apify_dubai_scraper.py | sed -n '40,60p'Repository: vishnu-madhavan-git/automation
Length of output: 1041
🏁 Script executed:
# Verify if there are any other usages of apify_request to understand the response pattern
rg "apify_request" agents/apify_dubai_scraper.py -B 1 -A 1Repository: vishnu-madhavan-git/automation
Length of output: 653
Treat the dataset-items response as a raw list.
Apify's /actor-runs/{runId}/dataset/items endpoint returns a JSON array directly (e.g. [{...},{...}]), not a wrapped object. The current code calls .get("data", {}) on the response, which will fail with AttributeError: 'list' object has no attribute 'get'. This differs from other Apify endpoints in this file (e.g. /actor-runs/{run_id}) which return wrapped responses.
Proposed fix
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
- result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
- items = result.get("data", {}).get("items", [])
+ items = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
+ if not isinstance(items, list):
+ log(f"Unexpected dataset response: {items}")
+ return []
log(f"Fetched {len(items)} items from Apify")
return items📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def fetch_results(token: str, run_id: str) -> list[dict]: | |
| """Fetch results from completed actor run.""" | |
| result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) | |
| items = result.get("data", {}).get("items", []) | |
| log(f"Fetched {len(items)} items from Apify") | |
| def fetch_results(token: str, run_id: str) -> list[dict]: | |
| """Fetch results from completed actor run.""" | |
| items = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) | |
| if not isinstance(items, list): | |
| log(f"Unexpected dataset response: {items}") | |
| return [] | |
| log(f"Fetched {len(items)} items from Apify") | |
| return items |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@agents/apify_dubai_scraper.py` around lines 107 - 111, The fetch_results
function assumes apify_request returned a dict and calls .get on it, but the
/actor-runs/{run_id}/dataset/items endpoint returns a raw list; update
fetch_results to handle both shapes by checking the type of the response from
apify_request (called in fetch_results) and set items = result if it's a list,
otherwise fall back to result.get("data", {}).get("items", []); keep the
log(f"Fetched {len(items)} items from Apify") and ensure the function returns
the items list.
| def load_existing_leads() -> list: | ||
| if LEADS_FILE.exists(): | ||
| try: | ||
| return json.loads(LEADS_FILE.read_text(encoding="utf-8")) | ||
| except Exception: | ||
| return [] | ||
| return [] | ||
|
|
||
|
|
||
| def save_leads(leads: list) -> None: | ||
| LEADS_FILE.write_text(json.dumps(leads, indent=2, ensure_ascii=False), encoding="utf-8") |
There was a problem hiding this comment.
Serialize writes to the shared leads store.
This code loads data/state/villa_leads.json, appends in memory, and rewrites the whole file. agents/dubai_villa_scraper.py does the same against the same path, so overlapping runs will drop whichever scraper saves first.
Also applies to: 198-201
🧰 Tools
🪛 Ruff (0.15.4)
[warning] 152-152: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@agents/apify_dubai_scraper.py` around lines 148 - 158, Concurrent runs
overwrite the shared JSON leads store because load_existing_leads and save_leads
read the whole file, mutate in-memory, and rewrite it; change these functions to
perform concurrency-safe updates (e.g., acquire a file lock around
read-modify-write or switch to an append-only/JSONL writer) so overlapping
scrapers don't lose data. Specifically, update load_existing_leads, save_leads
and any callers that append to LEADS_FILE so they obtain an exclusive lock on
LEADS_FILE (or open it in append mode for JSONL) before reading/writing, merge
new leads into the existing set safely, and use atomic replace/rename when
writing to avoid partial writes; ensure the same locking/format is used by
agents/dubai_villa_scraper.py to keep behavior consistent.
| def load_existing_leads() -> list: | ||
| if LEADS_FILE.exists(): | ||
| try: | ||
| return json.loads(LEADS_FILE.read_text(encoding="utf-8")) | ||
| except Exception: | ||
| return [] | ||
| return [] | ||
|
|
||
|
|
||
| def save_leads(leads: list) -> None: | ||
| LEADS_FILE.write_text(json.dumps(leads, indent=2, ensure_ascii=False), encoding="utf-8") |
There was a problem hiding this comment.
The shared leads store still has a lost-update race.
Like the Apify agent, this flow rewrites data/state/villa_leads.json via a read-modify-write sequence with no locking. If both agents overlap, the later write drops the earlier agent’s newly found leads.
Also applies to: 302-319
🧰 Tools
🪛 Ruff (0.15.4)
[warning] 53-53: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@agents/dubai_villa_scraper.py` around lines 49 - 59, The read-modify-write on
LEADS_FILE in load_existing_leads/save_leads causes lost-update races; wrap the
entire sequence that reads, merges new leads, and writes back with a file-level
lock (e.g., using a FileLock/portalocker) so concurrent agents serialize access
to LEADS_FILE, perform a read->merge-by-unique-id->write under the same lock,
and write atomically (write to temp + os.replace) to avoid corruption; apply the
same locking/merge pattern to the other lead-write logic referenced around the
second block (lines 302-319) and reference the functions load_existing_leads and
save_leads when locating the code to change.
| encoded_area = urllib.parse.quote(area) if area else "dubai" | ||
| url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1" | ||
|
|
||
| log(f"Bayut: starting scrape (area={area or 'Dubai'}, max={max_results})") | ||
| html = fetch_url(url) | ||
|
|
||
| if not html: | ||
| log("Bayut: failed to fetch listings page") | ||
| return leads | ||
|
|
||
| # Extract listing links | ||
| listing_urls = re.findall(r'"(https://www\.bayut\.com/property/[^"]+)"', html) | ||
| if not listing_urls: | ||
| listing_urls = re.findall(r'href="(/property/[^"]+)"', html) | ||
| listing_urls = [f"https://www.bayut.com{u}" for u in listing_urls] | ||
|
|
||
| log(f"Bayut: found {len(listing_urls)} listing URLs") | ||
|
|
||
| for listing_url in listing_urls[:max_results]: |
There was a problem hiding this comment.
Paginate Bayut results before honoring --max.
This implementation requests exactly one Bayut search page and never advances a page parameter. Any run that needs more leads than that first page exposes will silently under-collect, even though --max suggests otherwise.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@agents/dubai_villa_scraper.py` around lines 219 - 237, The scraper only
fetches the first Bayut search page so it under-collects when max_results is
larger; modify the listing collection around
encoded_area/url/fetch_url/listing_urls to paginate (add a page parameter, e.g.
append "&page={page}" to the existing query), loop fetching pages starting at
page=1, extract and accumulate unique listing URLs into listing_urls until you
have >= max_results or no new URLs are returned, and then slice to max_results
before iterating to create leads; ensure you break on empty pages and update
logs to show pages fetched.
| async function syncLeadsToSheets() { | ||
| const sheetsService = new (require("./sheets"))(); | ||
| const initialized = await sheetsService.init(); |
There was a problem hiding this comment.
Use the exported Sheets singleton instead of constructing it.
core/sheets.js already exports a service instance. Calling new (require("./sheets"))() here will throw before any sync happens; reuse the imported sheets object or export the class separately.
Proposed fix
async function syncLeadsToSheets() {
- const sheetsService = new (require("./sheets"))();
+ const sheetsService = sheets;
const initialized = await sheetsService.init();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/leads-bridge.js` around lines 29 - 31, The syncLeadsToSheets function
instantiates the Sheets module with new (require("./sheets"))() which is
incorrect because core/sheets.js exports a singleton; change the code to
require/import the exported sheets instance (the exported symbol from
"./sheets") and use that instance (e.g., sheets.init()) instead of creating a
new object; if you actually need the class, update core/sheets.js to also export
the class constructor and require that specific export, but by default reuse the
exported sheets singleton rather than calling new on require("./sheets").
| let synced = 0; | ||
| for (const lead of unsynced) { | ||
| try { | ||
| await sheetsService.addLead({ | ||
| name: lead.name, | ||
| phone: lead.phone, | ||
| notes: `Area: ${lead.area} | Price: ${lead.price} | Source: ${lead.source} | ${lead.url}` | ||
| }); | ||
| synced++; | ||
| } catch (err) { | ||
| console.error(`[leads-bridge] Failed to sync ${lead.phone}:`, err.message); | ||
| } | ||
| } | ||
|
|
||
| // Mark all as synced | ||
| const newSynced = [...loadSynced(), ...unsynced]; | ||
| saveSynced(newSynced); |
There was a problem hiding this comment.
Only persist leads that actually synced.
saveSynced() currently appends every unsynced lead, even when addLead() threw. A transient Sheets failure will therefore permanently suppress retries for that lead.
Proposed fix
let synced = 0;
+ const syncedNow = [];
for (const lead of unsynced) {
try {
await sheetsService.addLead({
name: lead.name,
phone: lead.phone,
notes: `Area: ${lead.area} | Price: ${lead.price} | Source: ${lead.source} | ${lead.url}`
});
synced++;
+ syncedNow.push(lead);
} catch (err) {
console.error(`[leads-bridge] Failed to sync ${lead.phone}:`, err.message);
}
}
// Mark all as synced
- const newSynced = [...loadSynced(), ...unsynced];
+ const newSynced = [...loadSynced(), ...syncedNow];
saveSynced(newSynced);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let synced = 0; | |
| for (const lead of unsynced) { | |
| try { | |
| await sheetsService.addLead({ | |
| name: lead.name, | |
| phone: lead.phone, | |
| notes: `Area: ${lead.area} | Price: ${lead.price} | Source: ${lead.source} | ${lead.url}` | |
| }); | |
| synced++; | |
| } catch (err) { | |
| console.error(`[leads-bridge] Failed to sync ${lead.phone}:`, err.message); | |
| } | |
| } | |
| // Mark all as synced | |
| const newSynced = [...loadSynced(), ...unsynced]; | |
| saveSynced(newSynced); | |
| let synced = 0; | |
| const syncedNow = []; | |
| for (const lead of unsynced) { | |
| try { | |
| await sheetsService.addLead({ | |
| name: lead.name, | |
| phone: lead.phone, | |
| notes: `Area: ${lead.area} | Price: ${lead.price} | Source: ${lead.source} | ${lead.url}` | |
| }); | |
| synced++; | |
| syncedNow.push(lead); | |
| } catch (err) { | |
| console.error(`[leads-bridge] Failed to sync ${lead.phone}:`, err.message); | |
| } | |
| } | |
| // Mark all as synced | |
| const newSynced = [...loadSynced(), ...syncedNow]; | |
| saveSynced(newSynced); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/leads-bridge.js` around lines 43 - 59, The code currently appends all
unsynced leads to persistent storage regardless of whether sheetsService.addLead
succeeded; change the logic to only persist leads that were actually synced by
collecting successful leads (e.g., push to a local array inside the try block or
incrementally build syncedLeads) and then call saveSynced([...loadSynced(),
...syncedLeads]) instead of saveSynced([...loadSynced(), ...unsynced]); keep the
existing try/catch around sheetsService.addLead and error logging for failed
attempts so transient failures remain retryable.
| ## Dubai Villa Lead Scraper | ||
|
|
||
| Two scraper agents for collecting direct villa owner contacts in Dubai: | ||
|
|
||
| ### Basic Scraper (no API key needed) | ||
| ```bash | ||
| python agents/dubai_villa_scraper.py --source both --area "Palm Jumeirah" --max 30 | ||
| ``` | ||
|
|
||
| ### Apify Scraper (faster, more reliable) | ||
| Requires `APIFY_TOKEN` in `.env` | ||
| ```bash | ||
| python agents/apify_dubai_scraper.py --area "Emirates Hills" --max 100 | ||
| ``` |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find the agents directory and list its contents
find . -name "agents" -type d | head -20
git ls-files | grep -E "agents/" | head -20Repository: vishnu-madhavan-git/automation
Length of output: 182
🏁 Script executed:
# Check if the agent files exist
ls -la agents/ 2>/dev/null || echo "agents/ directory not found at root"Repository: vishnu-madhavan-git/automation
Length of output: 465
🏁 Script executed:
# Search for these specific files
find . -name "dubai_villa_scraper.py" -o -name "apify_dubai_scraper.py"Repository: vishnu-madhavan-git/automation
Length of output: 137
🏁 Script executed:
# Check for version requirements in common config files
cat pyproject.toml 2>/dev/null | grep -i python || echo "No python requirement in pyproject.toml"
cat setup.py 2>/dev/null | grep -i python || echo "No setup.py found"
cat setup.cfg 2>/dev/null | grep -i python || echo "No setup.cfg found"Repository: vishnu-madhavan-git/automation
Length of output: 150
🏁 Script executed:
# Check the agent files for parameterized built-ins and union syntax
wc -l agents/dubai_villa_scraper.py agents/apify_dubai_scraper.pyRepository: vishnu-madhavan-git/automation
Length of output: 157
🏁 Script executed:
# Search for parameterized built-ins (list[...], dict[...], etc.) and union syntax (X | Y)
rg "list\[|dict\[|tuple\[|set\[|\s\|\s" agents/Repository: vishnu-madhavan-git/automation
Length of output: 1402
🏁 Script executed:
# Check the first 50 lines of each agent file for imports and type hints
head -50 agents/dubai_villa_scraper.pyRepository: vishnu-madhavan-git/automation
Length of output: 1482
🏁 Script executed:
# Check the apify agent file as well
head -50 agents/apify_dubai_scraper.pyRepository: vishnu-madhavan-git/automation
Length of output: 1584
Specify Python 3.10+ requirement for these commands.
The agents use PEP 604 union syntax (str | None) and parameterized built-ins (list[str], dict, etc.), which require Python 3.10+ and 3.9+ respectively. The README shows unqualified python commands, so users with system Python < 3.10 will fail at runtime. Add a note requiring Python 3.10+.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` around lines 57 - 70, Update the README to explicitly require
Python 3.10+ for running the scraper commands: add a short note above the
examples stating "Requires Python 3.10+" (or "Python 3.10 or later") because the
agents (agents/dubai_villa_scraper.py and agents/apify_dubai_scraper.py) use PEP
604 union syntax (e.g., str | None) and newer parameterized built-ins; also keep
the APIFY_TOKEN note for the Apify scraper so users know environment setup
requirements.
Summary
Adds automated Dubai villa owner lead collection for IXR interior design client acquisition.
New files
agents/dubai_villa_scraper.py— stdlib-only scraper (PropertyFinder + Bayut)agents/apify_dubai_scraper.py— Apify-powered scraper (faster, handles anti-bot)core/leads-bridge.js— syncs leads to Google Sheets CRMUsage
Sheets sync
Summary by CodeRabbit
New Features
Documentation